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Abstract 

Protein representation and potential function are two 
important ingredients for studying proteins folding, 
equilibrium thermodynamics, and sequence design. 
We introduce a novel geometric representation of 
protein contact interactions using the edge simplices 
from alpha shape of protein structure. This represen- 
tation can eliminate implausible neighbors that are 
not in physical contact, and can avoid spurious con- 
tact between two residues when a third residue is 
between them. We develop statistical alpha contact 
potential using an odds-ratio model. A studentized 
bootstrap method is then introduced for assessing the 
95% confidence intervals for each of the 210 propen- 
sity parameters. We found with confidence that there 
is significant long range propensity (> 30 residues 
apart) for hydrophobic interactions. We test alpha 
contact potential for native structure discrimination 
using several sets of decoy structures, and found it of- 
ten has comparable performance with atom-based po- 
tentials requiring many more parameters. We also 
show that accurate geometric representation is impor- 
tant, and alpha contact potential has better perfor- 
mance than potential defined by cut-off distance be- 
tween geometric centers of side chains. Hierarchical 
clustering of alpha contact potentials reveals natural 
grouping of residues. To explore the relationship be- 
tween shape representation and physicochemical rep- 
resentation, we test the minimum alphabet size nec- 
essary for native structure discrimination. We found 
that there is no significant difference in performance 
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of discrimination when alphabet size varies from 7 
to 20, if geometry is represented accurately by alpha 
simplicial edges. This result suggests that the geome- 
try of packing plays an important role, but the specific 
residue types are often interchangeable. 

Keywords: Simplicial edge; alpha contact; alpha 
shape; protein potential function; bootstrap; 



1 Introduction 

Potential function plays important roles for a vari- 
ety of computational studies of proteins, including 
prediction of protein structures, characterization of 
ensemble thermodynamic properties of proteins, and 
design of novel proteins. For example, an essential 
requirement for the prediction of three-dimensional 
structure of protein from primary sequence is a po- 
tential function which can select the native conforma- 
tion from an ensemble of alternative conformations. 
Potential function is also often used to guide the sam- 
pling of protein conformations [1]. A variety of poten- 
tial functions have been developed for these impor- 
tant tasks, including physical model based potentials 
[2-4], empirical statistical potentials [5-7], and em- 
pirical potentials obtained from optimization [8-12]. 

The effectiveness of potential function depends on 
another critically important factor, i.e., the represen- 
tation of protein structures. Within this framework, 
we explore a new type of pairwise contact potentials 
using a novel contact definition. We introduce a con- 
tact definition that reflects protein geometry more 
accurately. These contacts are based on the computa- 
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tion of the alpha shape, or the dual simplicial complex 
description of protein structures [13,14]. Here con- 
tact occurs if atoms from non-bonded residues share 
a Voronoi edge, and this edge is at least partially con- 
tained in the body of the molecule. In another word, 
atoms have to be in physical nearest neighbor con- 
tact. In this study, we only examine the 1-simplices, 
or edges in the dual simplicial complex, which rep- 
resents the pairwise contacts. This description is re- 
lated to the work from Wodak's group, where the con- 
tact area between atoms arc calculated as the area of 
intersection of the accessible atom ball around each 
atom and the faces of its weighted Voronoi cell [15]. 

We develop a statistical contact propensities based 
on alpha edge simplices and a combinatorial null 
model. To account for the uncertainty of estimated 
propensity parameter, we develop a studentized boot- 
strap method for estimating their 95% confidence in- 
tervals. We also examine how geometric represen- 
tation of dual simplicial complex influence the effec- 
tiveness of pairwise contact potential functions. In 
addition, we aim to understand how the size of alpha- 
bet of amino acid residues affect the effectiveness of 
empirical pair potentials. This is important for pro- 
tein design, where any reduction of the alphabet size 
of residues will result in exponentially more efScient 
sampling in the sequence space, therefore leading to 
more successful design strategies [16, 17]. 

This work is also motivated by the need to develop 
potential functions that takes advantage of recent de- 
velopment in computational geometry and computa- 
tional topology. The alpha shape representation of 
proteins from computational geometry allow rapid 
calculations of metric, topological, and combinato- 
rial structures of proteins precisely [14, 18, 19]. These 
advantages can lead to improvements in voids and 
binding site detection [18-20], in hierarchical repre- 
sentation of protein dynamic shapes at different res- 
olution, and in conformation sampling. Recent the- 
oretical development suggests many intriguing appli- 
cations in protein studies [21-23]. These important 
advances are largely unexploited currently, and we 
hope this work provides a useful link by developing 
empirical pairwise contact potentials based on dual 
simplicial complex representations of proteins. 

Our approach also solves a problem that cannot 
be satisfactorily addressed with current contact pair 
potentials. In current approaches, pairwise contact 
interactions arc declared if two residues arc within 
a specific cut-off distance. Potentials based on this 
contact definition have achieved considerable success. 



Nevertheless, contacts by distance cut-off can poten- 
tially include many implausible non-contacting neigh- 
bors, which have no significant physical interaction 
[24]. Whether or not a pair of residues can make 
physical contact depends not only on the distance 
between their center positions (such as or C^, 
or geometric centers of side chain), but also on the 
size and the orientations of side-chains [24] . Further- 
more, two atoms close to each other may in fact be 
shielded from contact by other atoms. As emphasized 
in [25], these contact pairs should not contribute to 
the assessment of pairwise contacts. By occupying 
the intervening space, other residues can block a pair 
of residues from interacting with each other. Inclu- 
sion of these fictitious contact interactions would be 
undesirable. 

We organized this paper as follows. First, we de- 
scribe briefly the dual simplicial complex represen- 
tation of proteins structures. We then discuss the 
probabilistic models for developing pairwise poten- 
tials. A bootstrap resampling procedure is then intro- 
duced which provide confldence intervals of estimated 
pairwise potential parameters. We then present the 
pairwise contact potential along with experimental 
results in discriminating native structure against de- 
coy structures using several benchmark data sets. We 
further examine how pair potentials developed from 
the dual simplices compare with cut-off contact def- 
initions using side-chain centers. The effects of re- 
duced alphabet size for amino acid residues are then 
described. We conclude with remarks and discussion. 



2 Model and Methods 

Alpha Contacts from Dual Simplicial Com- 
plex. Alpha shape has been successfully applied to 
study a number of problems in proteins, including 
void measurement, binding site characterization, pro- 
tein packing, electrostatic calculations, and protein 
hydrations [14,18,20,26-30]. Details of alpha shape 
have been described elsewhere, here we only provide 
a brief description for completeness. 

To illustrate, Figure la shows a two-dimensional 
molecule formed by a collection of disks of uniform 
size. Each Voronoi cell is deflned by its boundaries, 
shown as broken lines. Every Voronoi edge is a per- 
pendicular bisector of the line between two atom cen- 
ters. Each Voronoi cell contains one atom, and ev- 
ery point inside a Voronoi cell is closer to this atom 
than to any other atom. Three connected Voronoi 
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Figure 1: Geometric constructs of a simple 2D molecule, a) 
The molecule is formed by disks of uniform size. The dashed 
lines represent the Voronoi diagram, where each region con- 
tains one atom, b) The convex hull of the atom centers, and 
the Delaunay triangulation of the convex hull, c) The alpha 
shape of the 2D molecule is a subset of the Delaunay tri- 
angulation. It is contained within the molecule, and reflects 
the topological and metric properties of the molecule. 



edges meet at a Voronoi vertex. Another geomet- 
ric construct, the Delaunay triangulation (Fig lb) is 
mathematically dual to the Voronoi diagram, and can 
be explained by the following procedure. For each 
Voronoi edge, connect the corresponding two atom 
centers with a line segment, and for each Voronoi 
vertex, place a triangle spanning the three atom cen- 
ters of the three Voronoi cells. Completing this for 
all Voronoi edges and Voronoi vertices gives a collec- 
tion of line segments and triangles. Together with 
the vertices representing atom centers, they form the 
"Delaunay complex" , which is the underlying struc- 
ture of Delaunay triangulation. 

Now we remove all Delaunay edges (or line seg- 
ments) where corresponding Voronoi edge of the two 
atoms does not intersect with the molecular (Fig- 
ure Ic)). When two atoms are spatially very close, the 
balls representing the two atoms intersect, and these 
two atoms have non-zero, two-body volume overlap. 
When three atoms are spatially very close, they inter- 
sect and have non-zero, three-body volume overlap. 
We further remove all Delaunay triangles where the 
corresponding Voronoi vertex of the three atoms is 
not contained within the molecule. The subset of the 
Delaunay complex formed by the remaining triangles, 
edges and vertices (atom centers) is called the dual 
simplicial complex, or the alpha complex. We are in- 
terested in identifying only contacting atoms that are 
spatial nearest neighbors. These are precisely atoms 
with two-body volume overlaps whose Voronoi cells 
intersect. By following the mathematical dual struc- 
ture, i.e., the edges in the alpha complex, we can 
accurately identify all contacting nearest neighboring 
atom pairs. For convenience, we use a rather arbi- 
trary criterion and declare two residues are in alpha 



contact if there is at least one edge connecting these 
two residues. 

Using the alpha shape API kindly provided by 
Prof. Edelsbrunner, a program interface has been 
implemented to compute contacting atoms based 
on precomputed Delaunay triangulation and al- 
pha shape. The Delaunay triangulation is com- 
puted using the delcx program, and the alpha 
shapes computed using the MKALF program. Both 
can be downloaded from the website at NCSA 
(http://www.ncsa.uiuc.edu). The van der Waals 
radii of protein atoms are taken from [31]. We follow 
Singh & Thornton and increment the van der Waals 
radii by 0.5 A[32]. This increment is small and com- 
parable with the resolution of the structure. It en- 
ables the modeling of imprecisely determined atomic 
coordinates without introducing many spurious two- 
body contacts. 

Probabilistic Model for Pairwise Alpha Con- 
tact Propensity. The propensity P{i,j) for 
residue of type i interacting with residue of type j 
is modeled as the odds ratio of the observed proba- 
bility q{i,j) and the expected probability p{i,j) of a 
pairwise alpha contact involving both residue i and 
h ■ 



P{h]) 



(1) 



P{hj) 

To compute p{i,j) and q(i, j), we choose a simple null 
model, where the observed contacts of different pro- 
teins in the entire database are pooled together, and 
the expected contact numbers are calculated. This 
is the same as the reference state of composition- 
independent scale discussed in literature [33]. We 
have: 

Here, a{i,j) is the number count of atomic con- 
tacts between residue type i and residue type j, and 
Si' j' '^(^'i j') is the total number of all atomic con- 
tacts. The observed probability is then compared 
against the random probability p(i,j) that a pair of 
contacting atoms is picked from a residue of type i 
and a residue of type j, when chosen randomly and 
independently [34]. We have: 

p{i,])^N,Ny{ J^'"^' + Z"'"^" ), when i^j 
n[n — rii) n[n — nj) 



and 



pii,j) = N,N, 



i-l ■ 



n{n — Hi) 



when i — j 
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where Ni is the number of interacting residues of type 
i, rij is the number of atoms residue of type i has, and 
n is the total number of interacting atoms. 

The alpha contact potential between residue i and 
residue j is obtained from the propensity value P{i,j) 
as \nP{i,j), and the overall energy of a protein is 
calculated as: 

e=-^lnP(i,i); 

Cys-Cys Interactions. In principle, only 210 pa- 
rameters are necessary for 20 amino acid residues. 
However, Cys-Cys contact requires special attention. 
Its propensity value is the largest c;onipared to the 
other 209 contacts, because Cys residue tends to form 
disulfide bond with another Cys residue. Neverthe- 
less, there are many Cys-Cys residue pairs that are 
in close spatial proximity but do not form disulfide 
bond. As a result, a mis-classification of a non- 
disulfide Cys-Cys contact as a disulfide bond will af- 
fect the overall score considerably, especially for small 
proteins with abundant Cys residues. The problem 
associated with the mis-classification of Cys-Cys con- 
tact was already pointed out in [10]. 

To avoid assigning the same score to the two differ- 
ent types of Cys-Cys contacts, we introduce a slightly 
more detailed propensity scores for Cys-Cys interac- 
tions. Since contacts between C:0, C:N, C:C, and 
0:0 atoms never appear in disulfide bonds, a Cys- 
Cys contact pair lacking these atomic interactions is 
classified as a disulfide bond Cys-Cys contact if in 
addition the distance between their SG atoms is less 
than 2.5 A. All other Cys-Cys interactions arc classi- 
fied as nonbonded Cys-Cys contact. The propensity 
values estimated for these two types of Cys-Cys con- 
tacts are listed in Table II. 

Bootstrap Resampling. Because the sample size 
of 1,045 proteins in PDBSELECT is limited, statistical 

modeling with approximations may be prone to er- 
rors. It is therefore essential to assess reliability of es- 
timated contact potentials. Here we apply bootstrap 
techniques to calculate confidence intervals from sim- 
ulated data sets [35,36]. For alpha contact potential 
of a specific residue pair, {e.g., Trp-Trp), we denote 
the true value of the contact potential as 6. Our 
probabilistic model (Equation 1) can be regarded as 
an estimator T that gives the estimated value t from 
the finite amount of data for 6. Our goal is to calcu- 
late a 95% confidence interval for 9. 



We resample 1,045 proteins independently R times 
from the set of 1,045 proteins from Pdbselect , with 
replacement allowed. We have a simulated data set of 
Y{,..,Y^, each contains 1,045 proteins. Some struc- 
tures in the original Pdbselect set appear multiple 
times, some appear once, and some never appear. We 
estimate the pair contact value 9 from each of the R 
samples, and obtain For an cquitailcd 95% 

confidence interval (95% = 1 — 2a, a = 2.5%), we 
have the basic bootstrap confidence limits: 

(^(K+l)(l-a)' t(R+l)a) 

The accuracy of these limits depend on R, and how 
well the distribution of {t* — t} agrees with that of 
T — 9. Perfect agreement occurs only when the dis- 
tribution oiT — 9 does not depend on any unknown 
variables. 

To reduce possible errors due to unknown vari- 
ables, we use studentized bootstrap. For the rth boot- 
strapped sample, we calculate: 

* _ t* -t 

r *i/2 

Vr 

To obtain v*, we bootstrap with replacement again 
M times the rth sample of the original bootstrap. We 
have: 

1 ^ 

m— 1 

where t\,...t*j^ are calculated from the second boot- 
strap sampling. We then use the (i? -t- 1) • ath order 
statistic of the simulated values zl, z'^, or z^J^_^_-^^^ 
to obtain the studentized bootstrap confidence inter- 
val for 9: 

{t - ' t - ^^^^^(fl+l)a) 

Since M bootstrap samples from the rth sample 
are needed to obtain v*, the total number of nested 
bootstrap samples is {M+l)-R. We chose R= 1, 000 
and M = 50. Altogether, we generate 1,001 x 50 = 
50, 050 bootstrap samples to calculate the confidence 
intervals for each of the 209-1-2 pairwise alpha contact 
propensities. 

Database Selection. In this study, we use 
Pdbselect from http: //www. cmbi.kun.nl 

/swift/pdbsel [37,38]. Pdbselect contains 
1,045 proteins selected from the Protein Data Bank 
(PDB). The sequence identity between any pair of 
proteins in Pdbselect is smaller than 25%. 
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Table I: The alpha contact potential of pairwise interactions of amino acid residues. The upper triangle of the table lists 
the propensity values, the lower triangle lists the 95% confidence intervals. The 95% confidence intervals for the diagonal 
entries are listed separately. The potential values for the two different types of Cys-Cys contacts are also listed. 
" The alpha contact potential of Cys-Cys if all Cys-Cys conformations are classified as one type 

95% confidence interval of alpha contact potentials between self-pair of amino acid residues. 
" -^non-disulfide-bond potential of Cys-Cys without a disulfide bond; ^disulfide-bond potential of Cys-Cys with 

a disulfide bond. 



3 Results 

Pairwise Alpha Contact Potentials. The pair- 
wise alpha contact propensities are listed in Table II. 

These propensities arc calculated for all residue con- 
tacts that are at least 3 residues away in primary 



sequence. As expected, Cys-Cys has the highest 
propensities for contact interactions. Other residue 
pairs with the highest propensities for contact inter- 
actions {P{i,j) = 1.4 — 2.5) are pairs of hydropho- 
bic residues {e.g., Met-Met, Ala-Ala, Ile-Ile, Phe- 
Phe, Ile-Leu, and lie-Met). The group of residue 
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pairs with the second highest propensities {P{i,j) = 
1.2 — 1.3) are ionizable residues with opposite charges 
{e.g., Arg-Asp, Arg-Glu, Asp-Lys, and Gly-Lys). 
Residue pairs with lowest alpha contact propensi- 
ties {P{i,j) = 0.4 — 0.6) are dominated by pairs 
of ionizable residues of the same charge (e.^.,Arg- 
Lys, Asp-Glu, Lys-Lys, and Glu-Glu). The group 
of residue pairs with the second lowest alpha contact 
propensities {P{i,j) = 0.5 — 0.7) are between ion- 
izable residues and hydrophobic residues {e.g., Asp- 
Phe, Asp- He, Asp-Lcu, and Glu-Val). Noticeably, 
pairs of Pro and ionizable/polar residues also have 
very low propensity for contacting interactions, prob- 
ably due to the lack of a backbone- NH for H-bonding 
interactions. 

The confidence intervals of these propensity val- 
ues given by the studentized bootstrap procedure in- 
dicate that most of them are estimated accurately. 
Among the 209 parameters excluding Cys-Cys inter- 
action, the 95% confidence intervals for 153 contact 
propensities are < 0.2, a very tight interval. The con- 
fidence intervals of 36 contact propensities are < 0.3. 
Contact propensities with the largest confidence in- 
tervals around 0.6 are Trp-Trp, Met-Met, Cys-Met, 
and Cys-Trp. 

Correlating and Clustering Similar Amino 
Acid Residues. The overall behavior of pairwise 
contact interactions for a specific residue type is de- 
termined by its 20 pairwise contact propensity val- 
ues. These values represent a profile of contact in- 
teractions specific for the residue type, and can be 
represented as a 20-dimensional vector x. 

We group the 20 types of amino acid residues by the 
Euclidean distance between the 20 vectors [34]. Fig 2 
shows the grouping of the 20 amino acid residues ob- 
tained by hierarchical clustering. As an exploratory 
tool for data analysis, hierarchical clustering can dis- 
cover interesting and informative grouping patterns 
in data [39]. In Fig 2, residues that have close contact 
propensity values to the 20 residue types are grouped 
together. 

The pattern of groupings of residues clearly re- 
flect the physical and geometric characteristics of the 
amino acid residues. Cys residue is different from 
all other residues, because of its propensity to form 
disulfide bond. The rest of the 19 residues can be 
broadly divided into two well defined branches of 
hydrophobic and hydrophilic residues. Among the 
hydrophilic residues, ionizable residues with positive 
charge and negative charge are grouped into two in- 
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Figure 2: Grouping of the 20 types of amino acid residues 
by hierarchical clustering with complete linkage of the Eu- 
clidian distance between their 20-dimensional propensity vec- 
tors. 



dividual small clusters. Hydroxyl-containing residues 
(Ser and Thr) and amide residues (Asn and Gin) 
are also clustered into two individual small clusters. 
These residues are all capable of forming side chain 
hydrogen bonds, and their clusters are neighboring 
with each other, forming a larger cluster of Ser, Thr, 
Asn, and Gin. Among the hydrophobic residues, 
the branched residues (Val, lie, and Leu), and small 
residues (Ala and Gly) are grouped into their own 
clusters. Aromatic residues Trp and Tyr also group 
into one cluster. Phe has strong hydrophobicity and 
is group with other strongly hydrophobic amino acid 
residues, rather than clustering with the other two 
aromatic residues. Pro and His are grouped together, 
probably because both do not form strong favor- 
able contacts with either hydrophilic or hydrophobic 
residues. 

The clustering pattern of residues by alpha contact 
propensity also resembles to a certain extent the clus- 
tering pattern derived from mutation matrix BLO- 
SUM50 [40] as reported by Murphy et al [41]. For 
example, Arg with Lys, Asn with Gin, Ser with Thr, 
and Glu with Asp are all clustered tightly with each 
other. In addition to the distinct grouping of Cys in 
our clustering result, a notable difference is that by 
alpha contact propensity residues Ser, Thr and Pro 
are grouped with hydrophobic residues instead of hy- 
drophilic residues. 



6 





Misfold 


ifu 


Asilomar 


pdberr 

and spga 


RAPDF 


25/25 


32/44 


39/42 


5/5 


CDF 


19/25 


20/44 


36/42 


5/5 


GC 


25/25 


21/44 


32/42 


4/5 


MJ 


25/25 


21/44 


34/42 


5/5 


BT 


25/25 


22/44 


35/42 


4/5 


Alpha 


25/25 


24/44 


37/42 


4/5 



Table II: Discriminating native structure from decoys in 
ProStar data sets using alpha contact potential (Alpha). 
Results for each decoy subset are compared with those of 
other potentials, including RAPDF (Residue-specific All- 
atom Conditional Probability Discriminatory Function) [6], 
CDF (Contact Discriminatory Function) [6], MJ (Miyazawa 
& Jernigan Potential) [5], BT (Betancourt & Thirumalai 
Potential) [55] and GC (Geometry Center Based Potential). 
The first number in each cell is the number of correctly 
identified native proteins, and the second number is the to- 
tal number of proteins in the subset. The first row lists the 
names of the decoy subsets. Data for RAPDF, and CDF are 
taken from Figure lb of reference [6]. 

Discriminating Native Structures from De- 
coys. An important method to determine the effec- 
tiveness of contact potential functions is to evaluate 

its success and failure in distinguish native protein 
structures from incorrectly folded decoy structures. 
We use three decoy data sets to assess alpha contact 
potential. 

ProStar. The decoy sets in ProStar database [6] 
contain several subsets. The misfold subset contains 
conformations of 25 sequences which are obtained by 
placing these sequences on structures of different folds 
with the same number of residues. The conformations 
of the side chains are obtained by Monte Carlo sam- 
pling [42] . Alpha contact potential succeeded in iden- 
tifying all 25 native structures correctly (Table IIII). 

For the Asilomar subset, alpha contact potential 
failed to identify 5 native structures out of 42 pro- 
teins. For the IFU subset, alpha contact potential 
failed for 20 out of 44 native structures. Alpha con- 
tact potential belongs to the class of residue-based 
potentials, similar to MJ, BT, CDF [6]. As pointed 
out in [7], decoys in IFU subset are especially chal- 
lenging for residue-based potentials, because they are 
conformations of short loops, and have only a small 
number of residue contact interactions. 

In the subset pdberr and sgpa, the decoys are 

structures determined by diffraction but contain se- 
rious error, or are structures generated by molecu- 




RMSD 



Figure 3: Energy evaluated by alpha contact potential plot- 
ted against the RMSD to native structures for conformations 
in Park& Levitt Decoy Set. The alphabet of residues has 20 
types of amino acids. For vitamin D-dependent calcium- 
binding protein (3icb, a structure with better resolution 
(4icb) has the lowest energy (denoted by "+"). 

lar dynamics simulations starting from experimental 
conformations. Alpha contact potential missed one 
out of the five native structures. 

Park & Levitt Set. This decoy test set contains 
native and near-native conformations of seven se- 
quences, along with about 650 misfolded structures 
for each sequence. The positions of Ca of these de- 
coys were generated by exhaustively enumerating ten 
selectively chosen residues in each protein using a 4- 
state off-lattice model. All other residues were as- 
signed the phi/psi value based on the best fit of a 
4-state model to the native chain. Conformations 
in the decoy sets all have low score by a variety of 
scoring functions, and have low RMSD to the native 
structure (Table IIIIII) [43]. 

The results of discrimination test are listed in Ta- 
ble IVIV and plotted in Fig 3. For five of the seven 
proteins, the native structures have lowest energy by 
alpha contact potential. For protein 3icb and 4rxn, 
the native structures have the 5th and 51st lowest 
energy values, respectively. For all proteins, decoys 
with the lowest energy are within 2.5 A RMSD to the 
native structure. 

The protein 3icb is a vitamin D-dependent 
calcium-binding protein. Although the energy of the 
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Decoy bet 


Protein 


Description 




A^ decoy 


cRMSD range 


4state_ 


Ictf 


C-tcrmmal domain oi tiic nbosomal protcm L7/L12 


68 


630 


2.16 - 


10.16 


reduced 


lr69 


JN -terminal domain or phage 434 repressor 


63 


675 


2.28 - 


9.50 




lsn3 


Scorpion toxin variant 3 


65 


660 


2.50 - 


10.46 




2cro 


phage 434 Gro protein 


65 


674 


2.05 - 


9.72 




Jicb 


Vitamin D-dependent calcium-binding protein 


75 


653 


1.81 - 


10.74 




4pti 


trypsin inhibitor 


58 


687 


2.83 - 


10.79 




4rxn 


rubrcdoxin 


54 


677 


2.58 - 


9.28 


lattice. 


ibeo 


f3 -Cryptogein 


no 


onnn 
zUOO 


7.00 - 


15.61 


ssfit 


Ictf 


(see above) 


68 


2000 


5.45 - 


12.81 




Idkt-A 


Human Cyclm Dependent Kinase bubunit, lype 1 


72 


2000 


6.69 - 


14.05 




lica 


Ferredoxin 


55 


2000 


5.14 - 


11.39 




Inkl 


Nk-Lysin 


78 


2000 


5.27- 


13.64 




Ipgb 


Bl immunoglobulin-binding domain of 
streptococcal protein G 


56 


2000 


5.81 - 


12.91 




Itrl-A 


NMR solution structure of the G-terminal fragment 
255-316 of thermolysin 


62 


2000 


5.38 - 


12.52 




4lCD 


Galcium-Binding Protein 


10 


2000 


4.74 - 


12.92 


— j — 

Imcls 


1 Dull- 13 


„■ ■ — 7- — 

oini protein subunit 


oy 




2.45 - 


6.03 




Ibba 


Pancreatic hormone (AVE. NMR) 


36 


500 


2.78 - 


8.91 




Ictf 


(see above) 


68 


497 


3.59 - 


12.53 




Idtk 


Dendrotoxin K (NMR) 


57 


215 


4.32 - 


12.58 




Ifc2-C 


Fragment B of protein A (Gomplexed to 
immunoglobin Fc) 


43 


500 


4.00 - 


8.45 




ligd 


3rd IgG-binding domain from streptococcal protein G 


61 


500 


3.11 - 


12.56 




Ishf-A 


Fyn Proto-Oncogene Tyrosine 
Kinase subunit (SH3 domain) 


59 


437 


4.39 - 


12.35 




2cro 


(see above) 


65 


500 


3.87- 


13.48 




2ovo 


3rd domain of silver pheasant ovomucoid 


56 


347 


4.38 - 


13.38 




4pti 


(see above) 


58 


343 


4.94 - 


13.18 



Table III: Description of proteins in tlie 4-state-reduced decoy set, Lattice-ssfit decoy set and Lmds decoy set. 
The number of decoy structures and the cRMSD ranges are listed. 



native structure ranks as the 5th lowest energy, the 
RMSDs of the first four lowest energy decoys are all 
within 2.0 A RMSD to the native structure. For a 
higher resolution structure (4icb at 1.60 A) of this 
protein, the energy of 4icb by alpha contact poten- 
tial is lower than any of the decoys. It is possible 
that this mis-classification might be due to the lower 
resolution of structure of 3icb. 

Rubredoxin (4rxn) is an iron-sulfur protein. A 
Fe(III) ion is covalently bound in this structure with 
four Cys sulfur atoms, preventing these four Cys from 
forming two possible disulfide bonds. Because the 
protein description and the contact potential do not 
contain any information about the important cova- 
lent bonds of Cys with Fe-S cluster, it is reasonable 
to expect that native structure will not be of lowest 



energy if these important covalent bond interactions 
are unaccounted for. It is possible that the struc- 
ture of rubredoxin might be different without the Fe- 
S cluster. The decoys at the lowest energy states 
form one or two fictitious disulfide bridges, and all 
are near-native structures with RMSD around 2 A to 
the native structure. MJ and BT potential work bet- 
ter on 4rxn because they classify the four Cys-Cys 
contacts as disulfide bonds. 

Latticessfit Set. The lattice_ssfit set contains 
conformations for eight small proteins generated by 
ab initio protein structure prediction methods [44, 
45]. The conformational space of a sequence was ex- 
haustively enumerated on a tetrahedral lattice. A 
lattice-based scoring function was used to select the 
10,000 best scoring conformations. Secondary struc- 



8 

















rlgc 




Hmj 


Hbt 




JT LJID 




Zj 


1 


/ / 1 , uuu 


iXdlifi. 


Zj 




Zj 


Rank 


Z 




IClI 


1 
1 


O.Uo 


l.U 


1,UUU 


1 
1 


^ AO 
0.4z 


1 
1 


0. ( 


1 


3.86 


redu.c6(i 


iroy 


1 

1 


0.00 


1 n 

l.U 


1 nnn 

IjUUU 


Q 



9 

Z.04 


1 
1 


"4:. 1 ± 


1 


4.47 




isno 


1 
i 


o.lU 


1 n 
l.U 


1 nnn 
1,UUU 


Q 


AG 

z.4y 




Z 


Q 1 7 
0.1 i 


6 


2.97 




2cro 


1 
i 


o.uu 


1 n 

l.U 


1 nnn 

1,UUU 




9 Q1 

z.y 1 


1 
1 


/I 9Q 

4.zy 


1 


3.92 




3icb 


5 


2.19 


3.4 


52 


10 


2.14 


2 


2.80 


1 


2.83 






1 

J. 




1 

± .U 


1 000 

±,UUU 


1 1 


9 98 
z.zo 


Q 



0. ±u 


5 


2.65 




4rxn 




1 99 


n 

^o.u 




u 


1 

J. 


9 7^ 
z. 1 


1 

1. 


^ OQ 
o.uy 


1 


3.01 


— : 

lattice. 


Ibeo 


1 
i 


/I 7/1 


1 1 
1.1 


yzo 


z 


o.Dy 


1 
1 


/I 7/1 
4. / 4 


1 


7.29 


ssfit 




1 
1 


/I fi9 
4.DZ 


1 n 

l.U 


1 nnn 

1,UUU 


1 
1 


^ OQ 

o.uy 


1 
1 


o.oo 


1 


6.99 






1 

J. 




1 n 

l.U 


1 nnn 

1,UUU 


10 


9 

Z.OO 


^9 
oZ 


9 A1 

Z.4:l 


5 


3.49 




ilea 


/in 


9 ni 

z.Ui 


Q9 n 

oZ.U 


u 


Z04 


l.lo 


c 



Q /in 

0.4U 


2 


3.92'q 




1 nlrl 

inKi 


1 

i 


91 


l.U 


1 nnn 

1,UUU 


1 


7 9n 

/ .zU 


1 
1 


nQ 
o.uy 


1 


7.28 




Ipgb 


1 
i 


O.Ol 


1 n 

l.U 


Qfi/1 

yD4 


Q9 
OZ 


9 1 S 
z. lo 


Q 




Q 7S 


2 


3.82 




A 

ITl 





■)..>■) 




n 
U 


t)U4 


u.o.> 


A 
4 


9 01 

z.yi 


2 


3.82 




iicl) 


1 


I..")!) 


1.0 


1. ()()() 
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IbOn-B 


2 


3.13 


1.5 


525 


99 


0.85 


1 


2.65 


2 


2.50 




Ibba 


217 


0.03 


340.8 





441 


-1.11 


364 


-0.64 


234 


0.04 




Ictf 


1 


3.12 


1.0 


1,000 


74 


1.09 


1 


3.86 


1 


3.15 




Idtk 


2 


2.13 


2.2 


234 


173 


-0.92 


13 


1.71 


122 


-0.08 




Ifc2-C 


500 


-3.68 


500 





480 


-1.63 


501 


-6.24 


501 


-5.11 




ligd 


9 


2.43 


14.0 





138 


0.61 


1 


3.25 


1 


3.76 




Ishf-A 


17 


1.46 


8.2 





322 


-0.57 


11 


1.30 


16 


1.06 




2cro 


1 


4.36 


1.0 


999 


159 


0.44 


1 


5.07 


1 


4.01 




2ovo 


3 


3.07 


5.2 


29 


326 


-1.34 


2 


3.25 


31 


1.29 




4pti 


9 


2.23 


7.0 





242 


-0.49 


4 


2.53 


117 


0.42 



Table IV: Discriminating native structures using alpha contact potential Ha and potential by cut-off distance between 
geometric centers of side chains Hgc. The 4-state-reduced, Lattice-ssfit, and Lmds decoy sets are tested. 
" Rank of native structures. 

^ z = E — Enative/cF] E and a are the mean and standard deviation of the energy values of conformations, respectively. 
" Average ranking of native structures in energy evaluated using 1,000 bootstrapped potential values; 
The number of times of a native structure is ranked to have the lowest energy. 



tures were fitted to these conformations using a 4- 
state model [43]. The 10,000 conformations were fur- 
ther scored using a combination of an all-atom scor- 
ing function [6] , a hydrophobic compactness function, 
and a one-point per residue scoring function [46] . The 
2,000 best scoring conformations for each protein are 
selected as decoys for this data set. 

The result (Table IVIV) shows that for six out of 
eight proteins, no decoy structures score better than 
the native structure. The exceptions are If ca and 
Itrl. Similar to 4rxn in the Park & Levitt decoy 
set, ferrodoxin If ca contains a Fe-S cluster. Its four 
Cys residues form four covalent bond with the four 
Fe(III) irons, instead of two disulfide bridges. These 
critical contacts again arc unaccounted for in the al- 
pha contact potential, and therefore the native struc- 
ture of this protein was not identified successfully. 



ItrlA is a NMR solution structure of the C-terminal 
fragment (255-316) of thermolysin. NMR structures 
arc far more difficult to recognize, as discussed in 
detail in [10]. They are usually represented as an 
ensemble of conformations. The contact energies of 
conformations in the ensemble can be substantially 
different. It is conceivable that an energy function 
valid for crystal structures cannot reliably recognize 
native NMR structures [10]. In addition, the struc- 
ture ItrlA occurs in a dimeric state in the original 
PDB file. There is substantial interaction between 
the two chains. Because of this, it is unclear whether 
single subunit of ItrlA in monomeric state would re- 
tain the same conformation. 

The decoy structures in this data set are generated 
by ab initio methods. None of them are near-native, 
and all have cRMSD to the native structure greater 
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than 4.7 A (Table IVIII). When decoys are so differ- 
ent from the native conformation, energy evaluated 
using alpha contact potential shows little correlation 
with the RMSD. The lowest energy decoys in this 
data set all have large RMSD, similar to results re- 
ported in [47] (data not shown). 

Lmds Set. The local minima decoy set (lmds) con- 
tains decoys that were derived from the experimen- 
tally obtained secondary structures of ten small pro- 
teins that belong to diverse structural classes. Each 
decoy is a local minimum of a "hand made" energy 
fimction [48 51]. Ten thousand initial conformations 
were generated for each protein by randomizing the 
torsion angles of the loop regions [52]. The adja- 
cent local minima were found by truncated Ncwton- 
Raphson minimization in torsion space. Each protein 
is represented in the decoy set by its 500 lowest en- 
ergy local minima. 

The alpha contact energy function works fairly well 
in the recognition of IbOn-B, Ictf, Idtk, 2cro 
and 2ovo. Idtk (Dendrotoxin K) contains throe 
disulfide bonds in its native structure. However, in 
most of its 215 decoys, six Cys residues arc spa- 
tially arranged together and form on average seven 
Cys-Cys contacts. For some decoys, up to 15 Cys- 
Cys contacts by distance cut-off can be found. The 
ability of discriminating disulfide bonded versus non- 
disulfide bonded Cys-Cys contacts probably makes 
the alpha contact potential discriminate better than 
the MJ and the BT potentials. The native structure 
of ligd (IgG-binding domain from streptococcal pro- 
tein G) ranks first by the MJ and the BT potential, 
but ranks 9th by alpha contact potential. The recog- 
nition of Ibba and Ifc2-C in this set failed for all 
residue-based contact potentials. Ibba is an atypi- 
cal structure of a small protein determined by NMR, 
which forms a helix with random coil. If c2-C is a 
fragment of protein complexed with an immunoglob- 
ulin molecule. It possible that this protein may not 
maintain the same conformation without the com- 
plexed immunoglobulin. 

By the criterion of the ranking of native protein, 
with the exception of the ion-sulfur proteins 4rxn and 
If ca, the overall results shown in Table IVII and Ta- 
ble IVIV indicates that the performance of alpha con- 
tact potential in discriminating native protein from 
decoys is better than that of MJ and BT potentials 

for the MISFOLD, IFU, ASILOMAR, 4STATE_REDUCED 

sets and the lattice_SSFIT set, and has comparable 
results for the LMDS set, the pdberr set, and the 
SPGA set. 



4 Discussion 

Contact Definition. The alpha contacts intro- 
duced in this work are different from contacts by cut- 
off distances. Atoms in alpha contacts are all within 
a distance that depends on the identities of the two 
atoms. In this study, this distance is equal to the sum 
of the van der Waals radii of the two atoms, plus 
2 X 0.5 A. Unlike contacts by distance cut-off, this 
distance is not a single fixed constant but depends 
on the atom types. Another important distinction of 
alpha contact is that only a subset of atoms satisfy- 
ing the distance criterion will be counted as physi- 
cal nearest neighbors, because we have an additional 
criterion: contacting atoms must have intersecting 
Voronoi cells. Alpha contacts represents the geom- 
etry more accurately and can capture contact inter- 
actions due to side chain size and orientation [25]. 
In addition, no fictitious contacts will be introduced 
between two atoms when there is a third interven- 
ing atom [24]. Perhaps this is the reason why alpha 
contact potential is sensitive to the presence of Fe-S 
clusters and other hetero atoms, which can be po- 
tentially exploited for determining whether a protein 
structure should contain hetero atoms. 

For the 1,045 proteins in the PDBSELECT dataset, 
we compare contacts identified by distance cut-off 
with the threshold of two van der Waals atom radii 
plus 2 X 0.5 A and contacts identified by the alpha 
shape. We found that about 30 - 50 % of atom con- 
tacts detected by distance cutoffs are blocked by a 
third atom and hence do not have physical interac- 
tions. As a result, 3 - 6% residue contacts detected by 
distance cutoffs do not interact physically. Inclusion 
of these fictitious contact is problematic, especially in 
developing all-atom contact potentials, as well as in 
future studies when higher order interactions in the 
form of three or four body contacts are incorporated. 

Evaluating Discrimination of Alpha Contact 

Propensities by Bootstrap. How robust is the 

results of decoy discrimination to the specific values 
of alpha contact potentials and the specific choices 
of the structures in the database? We further make 
use of the bootstrap resampling technique to evaluate 
the reliability of the discrimination results. As dis- 
cussed earlier, we rcsamplcd 1,045 proteins in Pdbs- 
ELECT independently R = 1,000 times with replace- 
ment allowed, and obtain 1,000 contact propensity 
matrices. Each is then used to discriminate the de- 
coys in Table III. We use two parameters: f , the aver- 
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age ranking of the native structure, and /, the times 
a native structure ranked as the structure with the 
lowest energy. Table IV shows that for many decoy 
sets {e.g., Ictf , lr69, lsn3, 2cro, and 4pti in the 
4STATE_REDUCED Set), the native structure always 
ranks first in the 1,000 bootstrapped energy evalu- 
ations. The performance using 1,000 different sets 
of "bootstrapped" potential values validates the ro- 
bustness of the method deriving the alpha contact 
potential and the informativeness of the underlying 
protein structure database. 



Comparison with Contact by Geometric Cen- 
ters. Contact definition by distance cut-off is 
widely used in the development of many potential 

functions. Here we compare potentials obtained by 
alpha contact, denoted as Ha, and by contact defined 
by cut-off distance between geometric centers of side 
chains, denoted as Hgc. Following reference [53], two 
residues are declared to be in contact if the distance 
d between the geometric centers of their side chains 
is: 2A< d < 6.5A. Geometric center based contact 
potential Hgc are developed using the same Pdbse- 
LECT data, with the same null model as that of alpha 
contact potential, and similarly counting only con- 
tact between residues that are three or more residues 
apart. Therefore, any difference between Ha and Hgc 
is solely due to different geometric representation. 

The log values of the parameters of Ha and Hgc 
have an overall correlation coefficient of p = 0.77. 
However, the contact maps of individual proteins by 
these two different contact definition are often sub- 
stantially different. As an example, alpha shape dual 
simplicial complex gives significantly more contacts 
than cut-off distance by geometric centers of side 
chain for protein labe (Fig 4). 

Fig 5 illustrates why such a discrepancy exists be- 
tween these two contact models (sec Figiirc 5 legend). 
The strong correlation between Ha and Hgc is decep- 
tive and this is reflected in another aspect. Although 
the correlation coefficient is high, the pairwise contact 
potentials may have very different values for Ha and 
Hgc. Hgc categorically gives much higher propensity 
values for interactions between small residues. For 
example, Hgc for Gly-Gly and Ala-Gly are 4.55 and 
3.04, respectively, but Ha are only 1.48 and 1.39, 
respectively. Gly-Pro interaction is strongly favor- 
able by Hgc {P{Gly,Pro) = 1.84), but is unfavorable 
by Ha (0.87). On the other hand, Hgc gives much 
lower propensity values for interactions between large 
residues. For example, Hgc for Trp-Trp and Phe-Trp 
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Figure 4: Difference in contact histograms between the two 
contact definitions of alpha contact and contact by geometric 
centers for protein structure labe. Data along x-axis is arranged 
in ascending order by the frequency of contact pairs in the alpha 
shape contact model. There are frequently significant less con- 
tacts when contact defined by distance between geometric centers 
is used. 



are 0.39 and 0.56, respectively, but Ha are 1.75 and 
1.65, respectively. In addition, many pair contact 
interactions between Trp and another residue with 
large side chains (such as Arg, Tyr, Phe, His, Leu, 
He, and Met) are unfavorable (< 1.0) by Hgc, but are 
favorable by Ha- 

These differences lead to different discrimination 
in identifying native and near-native protein struc- 
tures from decoy structures. Because it is impossi- 
ble to define refined potential for Cys-Cys contacts 
in Hgc model as in alpha contact model, we exclude 
proteins containing Cys-Gys contacts to avoid com- 
plication for comparison. Table IVIV shows that Hgc 
can only recognize two native structures out of nine 
proteins in the union set of the 4-state decoy set and 
the lattice-ssfit decoy set. In addition, the z scores for 
native structures are generally higher when using Ha 
than Hgc- For the 4-state decoy set, the correlation 
coefficient p by Ha is also higher. These results indi- 
cate that geometric description of protein structures 
are important, and contact model by alpha shape are 
more accurate with more discriminative information 
for identifying native-like structures. However, the 
cut-off distance approach may be more convenient to 
implement, and it is possible to gain further improve- 
ment by setting the values of the cut-off threshold as 
a variable depending on the type of contacting residue 
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a b 



Figure 5: Alpha contact maps provide accurate geometric de- 
scriptions, a) Two Val residues are nearly parallel to each other. 
Their geometric centers are close enough, but no physical contacts 
occur between them, b) Two Lys are oriented linearly in opposite 
direction. Their geometric centers are far away from each other, 
but a hydrogen bond forms between O from one Lys and N from 
the other Lys. Alpha contact definition correctly identify a) as "in 
contact" and b) as not "in contact". Contact model by distance 
cut-off of geometric centers of side chains give different contact 
assignment in both cases. 



pairs. 



Comparison with Other Potentials. We further 
compare alpha contact potential with several previ- 
ously developed residue contact potentials, includ- 
ing MJ [54], BT [55], SK [33], and TD [9] poten- 
tials. We take the values of MJ potential from Ta- 
ble 3 of reference [54]. Following the author's rec- 
ommendation, the average hydrophobicity (e^r = - 
2.55) was subtracted from the potential, and a pair 
of residues is declared to be in contact if the geomet- 
ric centers of their side chain is within an interval of 
2.0 A to 6.5 A. The values of BT potential is taken 
from [55] and was obtained by rescaling MJ potentials 
with Thr as the reference solvent [55]. The correla- 
tion coefficients p and dispersions [55] between each 
pair of potentials are shown in table Table VV. The 
correlation coefficients p for alpha contact potential 
logP(i,j) and these residue contact potentials are: 
p = 0.66,0.80,0.61, and 0.66 for MJ, BT, SK, and 
TD potentials, respectively. The dispersion as de- 
fined in [55] (p.363. Formula 4) are 1.45,0.28,0.51, 
and 0.39, respectively. Because the contact map ob- 
tained by alpha edges can be substantially different 
from other contact definition (Fig 4), the absolute 
value of energy by alpha contact potential and by 
other potentials for the same structure can be sub- 
stantially different. 



Long Range Interactions. Interactions between 
residues with large sequence distance d are relatively 
rare. We found that they occur more likely in the 
interior of a protein than on the surface (data not 
shown). Identifying such interactions are of particu- 
lar interest, because these interactions result in sig- 
nificant reduction of conformational entropy. Pre- 
diction of protein structures seem to be most diffi- 
cult for those with large contact order [56], namely, 
those with significant amount of interactions between 
residues with large sequence distances. 

The bootstrap procedure introduced here provides 
a method to reliably identify the contact pairs of 
long sequence separation whose propensity values can 
be confidently assessed (see Supplementary Online 
Material for tables of alpha contact potential for 
d > 30). Among all possible 210 pairwise interac- 
tions, 9 contact pairs with high propensity (lower 
value of 95% confidence interval > 1.5) can be reli- 
ably assessed for d > 30. In addition to Cys-Cys, they 
include hydrophobic-hydrophobic interactions (Gly- 
Gly, Met-Met, Ile-Ile, Phe-Phe, Val- Val, Met-Phe), 
salt- bridge interactions (Arg-Asp, Asp-Lys), and Pro- 
Trp. 

Some long range interactions are clearly associ- 
ated with specific secondary structures. After cor- 
rection for prior probability to be in a particular 
secondary structure, we found that Met-Met con- 
tact has a high propensity to occur between two he- 
lices (h) or two beta-strands (s) {P{Met, Met)hh = 
2A,P{Met,Met)ss = 2.3), and a low propen- 
sity to occur between either a helix and a coil 
(c) {P{Met,Met)hc = 0.65), or a strand and a 
coil {P{Met, Met)sc — 0.56). Similarly, because 
Gly is a helix breaker, long-range Gly-Gly contact 
has a high propensity to occur between two coils 
[P {Gly , Gly) cc = 3.2), and low propensity to occur 
between two helices {P{Gly,Gly)hh = 0.53). 

Reduced Alphabet for Amino Acid Residues. 

The clustering of pairwise alpha contact potentials 
shown in Figure 2 suggests that many residues be- 
have similarly in contact interactions. This points 
to possible degeneracy of the amino acid alphabet 
[16,17]. Reduced alphabet is important because a 
smaller size of alphabet will lead to exponentially 
more efficient sampling methods in sequence design 
and protein engineering [57-61]. Many experimental 
and computational studies already suggested that a 
minimum number of amino acid residue types far less 
than 20 may be adequate for protein folding [40,62- 
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Alpha 


MJ 


BT 


GC 


SK 


TD 


Alpha 


1/0 


0.66/1.45 


0.80/0.28 


0.77/0.41 


0.61/0.51 


0.66/0.39 


MJ 


0.66/1.45 


1/0 


0.66/1.43 


0.37/1.60 


0.73/1.29 


0.67/1.35 


BT 


0.80/0.25 


0.66/1.43 


1/0 


0.49/0.56 


0.76/0.41 


0.63/0.37 


GC 


0.77/0.41 


0.37/1.60 


0.49/0.56 


1/0 


0.15/0.81 


0.43/0.63 


SK 


0.61/0.55 


0.76/1.29 


0.82/0.41 


0.15/0.81 


1/0 


0.64/0.52 


TD 


0.66/0.39 


0.67/1.35 


0.63/0.37 


0.43/0.63 


0.64/0.52 


1/0 



Table V: The correlation coefficients and dispersions [55] between each pair of potentials. The first number of each cell 
is the correlation coefficient between each pair of potentials. The second number is the dispersion between each pair of 
potentials. 



64]. Wang and Wang examined diflferent ways to re- 
duced MJ interaction matrix and concluded that by 
minimizing mismatches, a reduced alphabet of just 
five amino acid residue types can be used to construct 
sequences with good foldability and kinetic accessi- 
bility [65]. The reduced 5 alphabet set coincide with 
the same alphabet set reported in the work of Rid- 
dle et al, where fully functional constructs for a small 
57-residue beta-barrel protein can be experimentally 
obtained when residues in 38 out of 40 selected amino 
residues are drawn from the alphabet set of I, K, E, 
A, and G. Murphy et al further examined reduced 
alphabets based on BLOSUM50 substitution matrix 
[41]. When using a variety of reduced alphabets with 
size ranging from 10 to 20, they found that there is 
little loss of the information necessary to select struc- 
tural homologs in a database of representative protein 
sequences using dynamic programming based global 
alignment. 

We continue investigation in this direction and 
study the capability of various reduced alphabet sets 
in discriminating native proteins from decoys. Fig 2 
provides a natural way of reducing the residue al- 
phabet set that is similar to the approach used by 
Murphy et al [41]. By placing a horizontal line at 
different vertical heights, we can obtain a reduced 
residue alphabet that is determined by the heights 
of the branching points in the dendrogram from the 
hierarchical clustering. For example, we can have 
an alphabet of 7 residue types at height about 1.5: 
A = {D,E}, B = {R,K}, C= {S,T, N, Q, H, P}, 
V = {V, I, L, M, F, W}, £ = {W, Y}, T = {A, 
G}, and Q = {C}. An alphabet of two residue types 
would take Cys as a residue type, and everything else 
grouped into another residue type. An alphabet of 
three residue types would have Cys, polar residues, 
and hydrophobic residues. 

Does reduced alphabet still capture the basic in- 




2 4 6 8 



RMSD 

Figure 6: Energy evaluated by alpha contact potential plot- 
ted against the RMSD to native structures for conformations 
in Park& Levitt Decoy Set. The alphabet of residues has 9 
types of amino acids. The discrimination is similar to that 
shown in Fig 3. 4icb is denoted by and has the lowest 
energy. 



formation of protein contact interactions? We use 
Equation 1 to estimate pairwise alpha contact poten- 
tials for reduced alphabet size of 2, 3, 5, 7, 8, 9, 10, 
11, 15, and 20, and test their effectiveness in selecting 
native structure from decoys in the Park and Levitt 
data set. Fig 6 shows the result when an alphabet set 
of 9 residue types, plus the three types of Cys-Cys 
contacts are used. Remarkably, the discrimination of 
native conformation from decoys is almost as good as 
when there are 20 different residue types. 

Figure 7 shows the detailed results of discriminat- 
ing native structure of 3icb from decoys using poten- 
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* E 


Ictf 


lr69 


lsn3 


2cro 


3icb 


4pti 


4rxn 


2 


0.31 


0.35 


0.33 


-0.15 


0.13 


0.40 


0.17 


3 


0.39 


0.44 


0.48 


0.34 


0.73 


0.44 


0.39 


5 


0.41 


0.47 


0.47 


0.38 


0.71 


0.47 


0.51 


7 


0.45 


0.44 


0.48 


0.46 


0.71 


0.49 


0.48 


8 


0.54 


0.46 


0.42 


0.48 


0.68 


0.52 


0.50 


9 


0.47 


0.48 


0.47 


0.47 


0.72 


0.47 


0.49 


10 


0.47 


0.49 


0.47 


0.48 


0.71 


0.48 


0.51 


11 


0.47 


0.50 


0.49 


0.47 


0.72 


0.49 


0.51 


15 


0.49 


0.50 


0.49 


0.50 


0.71 


0.47 


0.49 


20 


0.49 


0.52 


0.49 


0.50 


0.74 


0.46 


0.53 



Table VI: Correlation coefficients p between energy evaluated using alpha contact potential and RMSD to the native 
structures for decoys in the Park and Levitt Set. Alpha contact potentials with alphabet containing different number of 
residue classes are used. 

*|E| denotes the number of residue classes in the alphabet E as obtained from Fig 2. 



tials derived from different alphabet sets. The aver- 
age RMSDs of n decoys with lowest energy by poten- 
tials of different alphabets are calculated. Smaller av- 
erage RMSD of these lowest energy decoys indicates 
that a large fraction of them are near-native struc- 
tures. This would suggest good discrimination. The 
top n = 100 decoys of lowest energy are all found to 
have very similar average RMSD to the native struc- 
ture, regardless the size of the alphabet. This sug- 
gests that alphabet with just a few residues have re- 
spectable results in decoy discrimination. Table VIVI 
further shows the correlation coefficient of energy and 
RMSD for all proteins in the Park and Levitt data set 
when using different alphabets. The results indicate 
that an alphabet of 7 residues would have very sim- 
ilar performance as an alphabet with 20 residues in 
discriminating decoys. Our results extended earlier 
work where subjectively defined alphabet sets were 
used to extract contact residue potentials by an iter- 
ative optimization method [9]. It was found that po- 
tential derived using such reduced alphabets were ef- 
fective in discriminating decoys generated by gapless 
threading. Here we showed that similar conclusion 
can be drawn for statistical potential using alphabet 
sets derived from natural clustering of residues, in dis- 
criminating natives against more stringent compact 
decoy conformations generated by off-lattice model 
[43]. Our conclusion is also consistent with a recent 
study where it is shown that much of the informa- 
tion in pairwise contact potential is related to just 
a few variables such as hydropathy, charge, disulfide 
bonding, and residue burial [66]. 

Although the alphabets we used have different 
number of residues, they are all developed with one 



aspect in common, i.e., the contacts are all derived 
from the dual simplicial complexes, which provides 
a faithful representation of the geometry. This sug- 
gests that as long as the same space filling pattern is 
conserved, the specific residue types are not critical 
in many cases. It seems that packing geometry plays 
a very important role, but the specific residue types 
are often replaceable. This observation is consistent 
with experimental results where it is well known that 
proteins are robust against many mutations. 



Edge Simplices and Tetrahedron Simplices. 

Pairwise alpha contact potential only considers the 
edge simplices or 1-simplices in the dual simplicial 
complex. There have been several studies of sta- 
tistical potential based on 3-simplices or tetrahedra 
[67-69]. In the work of Tropsha et al, 3-simplices 
are obtained from unweighted Delaunay triangulation 
[67,68]. In these studies, all residues are treated as 
balls of equal size located at or C/3 positions, and 
a cut-off distance is used to remove tetrahedra that 
are considered to be too large. Our approach is dif- 
ferent. Our model is instead based on the weighted 
dual simplicial complex, a different simplicial com- 
plex formed by a subset of the simplices from the 
weighted Delaunay triangulation of all atoms in the 
molecule. The dual simplicial complex or the alpha 
shape allows modeling at atomic level. Therefore, in 
our approach contacts can be defined by the full side 
chain and main chain atoms. Additionally, atoms 
are assigned with appropriate non-uniform van der 
Waals radii [31]. Finally, because the dual simpli- 
cial complex reflects the precise contact geometry, we 
avoid the use of heuristic cut-off thresholds necessary 



14 



5 Acknowledgment 




OJ - 



100 200 300 400 500 600 
Decoys Ranked by Energy 

Figure 7: Discriminating native structure of 3icb from de- 
coys using different alphabets. For each alphabet, we rank 
the decoys by their energy. The average RMSD to the native 
structure is then calculated for the top n decoys at different 
n values. Top dashed line represents the average RMSD if 
the decoys are chosen randomly; Dotted line represents the 
ideal case that n decoys with smallest RMSD are chosen 
by an ideal function; The other lines represent the average 
RMSD of decoys chosen by potentials of an alphabet with 
2-20 amino acid types. 

to eliminate a subset of the simpliccs from the De- 
launay triangulation. Our contacts represents accu- 
rately geometry of the structure. We discussed earlier 
the difFcrences between the alpha contact and contact 
by distance cut-off between geometric centers of side 
chains. It is conceivable that similar difference will 
result between alpha contact and the approach de- 
scribed in [67, 68]. 

Summary. In this work, we introduced a novel rep- 
resentation of protein structures using edge simplices 
of the alpha shape, or the dual simplicial complex of 
the protein structure. By describing pairwise contact 
interactions with simplicial edges, we developed alpha 
contact potential based on the statistics of edge sim- 
plices. We also developed a bootstrap model which 
provides confidence interval estimations, including 
those of long range interactions. We found that alpha 
contact potential performs well in decoy structure dis- 
crimination. By comparing with alternative contact 
potential, we conclude that geometric representation 
of contact interaction is important, but the specific 
residue types are often interchangeable. 
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